Pruning and quantization for deep neural network acceleration: A survey

نویسندگان

چکیده

Deep neural networks have been applied in many applications exhibiting extraordinary abilities the field of computer vision. However, complex network architectures challenge efficient real-time deployment and require significant computation resources energy costs. These challenges can be overcome through optimizations such as compression. Network compression often realized with little loss accuracy. In some cases accuracy may even improve. This paper provides a survey on two types compression: pruning quantization. Pruning categorized static if it is performed offline or dynamic at run-time. We compare techniques describe criteria used to remove redundant computations. discuss trade-offs element-wise, channel-wise, shape-wise, filter-wise, layer-wise network-wise pruning. Quantization reduces computations by reducing precision datatype. Weights, biases, activations quantized typically 8-bit integers although lower bit width implementations are also discussed including binary networks. Both quantization independently combined. current techniques, analyze their strengths weaknesses, present compressed results number frameworks, provide practical guidance for compressing

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Deep Neural Network Compression Pipeline: Pruning, Quantization, Huffman Encoding

Neural networks are both computationally intensive and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources. To address this limitation, We introduce a three stage pipeline: pruning, quantization and Huffman encoding, that work together to reduce the storage requirement of neural networks by 35× to 49× without affecting their accuracy. Our method...

متن کامل

Automated Pruning for Deep Neural Network Compression

In this work we present a method to improve the pruning step of the current state-of-the-art methodology to compress neural networks. The novelty of the proposed pruning technique is in its differentiability, which allows pruning to be performed during the backpropagation phase of the network training. This enables an end-to-end learning and strongly reduces the training time. The technique is ...

متن کامل

Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding

Deep Compression is a three stage compression pipeline: pruning, quantization and Huffman coding. Pruning reduces the number of weights by 10x, quantization further improves the compression rate between 27x and 31x. Huffman coding gives more compression: between 35x and 49x. The compression rate already included the metadata for sparse representation. Deep Compression doesn’t incur loss of accu...

متن کامل

Adaptive Quantization for Deep Neural Network

In recent years Deep Neural Networks (DNNs) have been rapidly developed in various applications, together with increasingly complex architectures. The performance gain of these DNNs generally comes with high computational costs and large memory consumption, which may not be affordable for mobile platforms. Deep model quantization can be used for reducing the computation and memory costs of DNNs...

متن کامل

Neural Network Pruning and Pruning Parameters

The default multilayer neural network topology is a fully interlayer connected one. This simplistic choice facilitates the design but it limits the performance of the resulting neural networks. The best-known methods for obtaining partially connected neural networks are the so called pruning methods which are used for optimizing both the size and the generalization capabilities of neural networ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Neurocomputing

سال: 2021

ISSN: ['0925-2312', '1872-8286']

DOI: https://doi.org/10.1016/j.neucom.2021.07.045